[test] test_topk_plain: parametrize sweep to fix collection-time OOM by JohnQinAMD · Pull Request #3934 · ROCm/aiter

JohnQinAMD · 2026-06-25T16:04:01Z

Problem

CI runs each op test as a script (python3 <file> in .github/scripts/aiter_test.sh), not via pytest. test_topk_plain.py sweeps 84 cases up to 3072 × 131072 fp32 and never frees tensors between cases, so on a memory-pressured gfx950 runner it OOMs mid-sweep and the whole file exits non-zero — producing intermittent "Standard Tests (MI35X, 8, 5)" failures unrelated to the code under test. On a clean GPU it passes; the failure is purely residual-memory dependent.

Fix (keeps the `python3 <file>` contract)

Free each case's tensors (del + gc.collect() + torch.cuda.empty_cache()) so peak memory is one case, not the whole sweep — the actual OOM fix.
Guard the sweep under if __name__ == "__main__": so import is side-effect-free.
num_iters 1000 → 100 — this is a correctness check, not a perf gate; the high iteration count made it the slowest file in its shard for no benefit.
Vectorize the per-row permutation (drops a batch_size-long Python loop).
Assert on checkAllclose's returned error ratio — it does not raise, so the test previously would have passed even if topk_plain were incorrect.
Summary table falls back to df.to_string when tabulate is absent, so the informational summary can never fail the run.

Validation

python3 op_tests/test_topk_plain.py on MI355X (gfx950): all 84 cases run, all passed~ (err 0), summary prints, exit 0.

Addresses both Copilot review comments (script-style no-op under python3; assert on the error ratio).

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-06-25T16:04:19Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3934 --add-label <label>

Copilot

Pull request overview

This PR refactors op_tests/test_topk_plain.py to avoid collection/import-time GPU OOMs caused by a module-level sweep that allocates tensors for all cases up front. It converts the sweep into isolated, lazy-per-case execution using pytest parameterization, adds teardown to free GPU memory between cases, and preserves the end-of-run performance summary.

Changes:

Replace the module-level sweep loop with @pytest.mark.parametrize test cases to prevent collection-time allocations and speed up collection.
Add an autouse fixture to GC + torch.cuda.empty_cache() between cases and a session-scoped fixture to emit a single markdown perf summary (with a safe fallback when tabulate is missing).
Vectorize the input permutation generation and reduce iteration counts (1000 → 20) to keep this as a correctness-focused test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CI runs each op test as a script (`python3 <file>` in aiter_test.sh), not via pytest. test_topk_plain.py sweeps 84 cases up to 3072x131072 fp32; with no cleanup between cases, a memory-pressured gfx950 runner OOMs mid-sweep and the whole file exits non-zero -> intermittent shard failures unrelated to the code under test. Fix (keeps the `python3 <file>` contract): - free each case's tensors (del + gc + torch.cuda.empty_cache()) so peak memory is one case, not the whole sweep - guard the sweep under `if __name__ == "__main__":` (clean import) - num_iters 1000 -> 100 (correctness check, not a perf gate; was the slowest file in its shard) - vectorize the per-row permutation (drops a batch_size-long Python loop) - assert on checkAllclose's returned error ratio (it does NOT raise) so an incorrect topk_plain actually fails CI instead of silently passing - summary table falls back to df.to_string when `tabulate` is absent Validated: `python3 op_tests/test_topk_plain.py` runs all 84 cases, all pass (err 0), summary prints, exit 0 on MI355X (gfx950). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

zufayu

LGTM — the change is a test quality fix (OOM prevention, add assert on error ratio, reduce iter count). The single CI failure () is a pre-existing issue on main (same in , unrelated to this PR).

zufayu

LGTM. The change is a test quality fix: OOM prevention, adds assert on error ratio, reduces iter count. The single CI failure (Standard Tests MI35X shard 6) is a pre-existing issue on main — same RuntimeError in test_moe_2stage.py, unrelated to this PR.

zufayu · 2026-06-29T04:27:18Z

@valarLip Could you merge this when you get a chance? The CI failure (Standard Tests MI35X shard 6) is a pre-existing issue on main branch (RuntimeError in test_moe_2stage.py, same failure exists on main today), unrelated to this PR's changes.

JohnQinAMD requested review from a team and Copilot June 25, 2026 16:04

Copilot started reviewing on behalf of JohnQinAMD June 25, 2026 16:04 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread op_tests/test_topk_plain.py Outdated

Comment thread op_tests/test_topk_plain.py Outdated

JohnQinAMD force-pushed the fix/topk-plain-collection-oom branch 4 times, most recently from b427fd5 to 7c7c7f6 Compare June 26, 2026 05:00

zufayu self-requested a review June 26, 2026 08:53

JohnQinAMD force-pushed the fix/topk-plain-collection-oom branch from 7c7c7f6 to 9a2655e Compare June 27, 2026 05:08

zufayu approved these changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[test] test_topk_plain: parametrize sweep to fix collection-time OOM#3934

[test] test_topk_plain: parametrize sweep to fix collection-time OOM#3934
JohnQinAMD wants to merge 1 commit into
mainfrom
fix/topk-plain-collection-oom

JohnQinAMD commented Jun 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zufayu left a comment

Uh oh!

zufayu left a comment

Uh oh!

zufayu commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JohnQinAMD commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix (keeps the python3 <file> contract)

Validation

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Jun 25, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

zufayu left a comment

Choose a reason for hiding this comment

Uh oh!

zufayu left a comment

Choose a reason for hiding this comment

Uh oh!

zufayu commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohnQinAMD commented Jun 25, 2026 •

edited

Loading

Fix (keeps the `python3 <file>` contract)